MASTERS THESIS Functional Semantic Analysis of Web Pages on the Visual Layer
نویسندگان
چکیده
This masters thesis is motivated by the fact that data records on web pages are structured not only by word content but also by an implied visual hierarchy. A model of this visual hierarchy can greatly support automatic information extraction approaches become more domain independent and robust against variations of HTML syntax changes because the representation of information on the visual layer has to remain rather constant so as to remain understandable by humans. We refer to this visual layer as functional level which expresses the functional support for humans when structuring information visually. This masters thesis first gives a thorough literature overview on (visual) document analysis and then presents such a functional level record detection system named REDEVILA (REcord DEtection on the VIsual LAyer). The approach works by superimposing a multi-topological grid onto the visual layer of web pages serving as an efficient spatial reasoning data structure for detecting the functional semantics between data items or data records. The system is principally domain independent as long as the layout hierarchy provided by the web page mainly depends on general topological and geometrical characteristics such as font size, distance and indention and not on color properties or word semantics. We further propose a novel diagonal ordering scheme to obtain a more “natural” or human-intuitive ordering and demonstrate the concept and problems of the visual based detection of single records. For the experimental evaluation we selected web pages from four different domains (blogs, search results, personal homepages and online newspapers) to show the basic domain independence of our system. Experiments were performed on 85 web pages and achieved a fair overall performance. We conclude that, while in its early stages, the visual approach has the potential to significantly improve the performance and robustness of traditional wrapper systems to induce a higher level of generalization and represent a next step towards generic web wrapping.
منابع مشابه
Use of Semantic Similarity and Web Usage Mining to Alleviate the Drawbacks of User-Based Collaborative Filtering Recommender Systems
One of the most famous methods for recommendation is user-based Collaborative Filtering (CF). This system compares active user’s items rating with historical rating records of other users to find similar users and recommending items which seems interesting to these similar users and have not been rated by the active user. As a way of computing recommendations, the ultimate goal of the user-ba...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملInférer des Objets Sémantiques du Web Structuré
This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose, and other statistics. Next, in the context of dynamically-generated Web...
متن کاملQuery Architecture Expansion in Web Using Fuzzy Multi Domain Ontology
Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008